Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements in error handling on end-to-end tests #2716

Merged
merged 4 commits into from
Dec 20, 2022

Conversation

narrieta
Copy link
Member

Implemented a few improvements in error handling; I'm pointing them out with comments within the PR

@codecov
Copy link

codecov bot commented Dec 16, 2022

Codecov Report

Merging #2716 (f0df49e) into develop (d56a3c7) will increase coverage by 0.01%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           develop    #2716      +/-   ##
===========================================
+ Coverage    71.95%   71.96%   +0.01%     
===========================================
  Files          104      104              
  Lines        15765    15765              
  Branches      2244     2244              
===========================================
+ Hits         11343    11345       +2     
- Misses        3906     3909       +3     
+ Partials       516      511       -5     
Impacted Files Coverage Δ
azurelinuxagent/common/cgroupconfigurator.py 72.24% <0.00%> (+0.31%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

# The same orchestrator machine may be executing multiple suites on the same test VM, or
# the same suite on one or more test VMs; we use this file to mark the build is already done
build_done_path = self._working_directory/"build.done"
if build_done_path.exists():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we remove this case because we're no longer executing multiple suites on the same VM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, I explained the motivation in the PR comment just a couple of lines above.

@@ -14,9 +14,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#

from collections.abc import Callable
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change in this module is that I stopped using before_case and after_case for setup and cleanup.

The issue is that errors in before_case are not reported as failures, they simply make LISA skip test execution. Tests can be skipped for valid reasons (e.g. an unsupported OS), but skipping tests on setup failures without reporting an error is not correct, since those errors may go unnoticed on daily runs.

I replaced the use of before_case and after_case with custom code (mainly in the execute() method).

)
# E0401: Unable to import 'lisa.sut_orchestrator.azure.common' (import-error)
from lisa.sut_orchestrator.azure.common import get_node_context # pylint: disable=E0401

from azurelinuxagent.common.version import AGENT_VERSION


class AgentTestSuite(TestSuite):
class AgentTestScenario(object):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class used to be an specialization of LISA's TestSuite in order to use the before_case/after_case protocol, but since I am not using anymore now there is no need for that class relationship. Now this is a plain AgentTestScenario class (it is very similar in concept to a DCR scenario) .

I renamed the class, but will rename the py file until my next PR, otherwise reviewing the actual code changes would be more difficult.


self._log.info(f"Test Node: {self._vm_name}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is now in the _setup() method

self._clean_up()

def _initialize(self, node: Node, log: Logger) -> None:
self._node = node
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now done in init()

@@ -112,18 +108,10 @@ def _build_agent_package(self) -> Path:
"""
Builds the agent package and returns the path to the package.
"""
build_path = self._working_directory/"build"

# The same orchestrator machine may be executing multiple suites on the same test VM, or
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the "*.done" files (they actually never worked, since I am removing the working directory on cleanup).

In the current DCR automation we use a new test VM to run 1 single scenario. I want to be able to run multiple scenarios on the same test VM in the new automation pipeline and the intention of those "*.done" files was to avoid duplicated setup. However, not only the implementation was incorrect, but that approach was cumbersome during test development. I will come up with a better approach when I start combining scenarios.


def _execute_script_on_node(self, script_path: Path, parameters: str = "", sudo: bool = False) -> int:
def execute_script_on_node(self, script_path: Path, parameters: str = "", sudo: bool = False) -> int:
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it public, added a docstring and did minor formatting improvements in its logging

@@ -15,5 +15,11 @@ tar --exclude='journal/*' --exclude='omsbundle' --exclude='omsagent' --exclude='
/var/lib/waagent/ \
/etc/waagent.conf

# tar exits with 1 on warnings; ignore those
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are hitting a warning about the logs changing while we are creating the tarball. It's just a warning, but tar exits with code 1 (fatal errors are 2, see the man page for details)

@@ -59,7 +74,7 @@ echo "Restarting service..."
service $service_name stop

# Rename the previous log to ensure the new log starts with the agent we just installed
mv /var/log/waagent.log /var/log/waagent.pre-install.log
mv /var/log/waagent.log /var/log/waagent."$(date --iso-8601=seconds)".log
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a timestamp to the previous log. Useful if the scenario is run multiple times on the same VM (e.g. during test development) or if multiple scenarios run on the same VM (in the future).

@@ -0,0 +1,31 @@
# Microsoft Azure Linux Agent
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a "sample" LISA test suite that runs on the local machine. Useful to do quick tests/experiments with LISA

# The same orchestrator machine may be executing multiple suites on the same test VM, or
# the same suite on one or more test VMs; we use this file to mark the build is already done
build_done_path = self._working_directory/"build.done"
if build_done_path.exists():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, I explained the motivation in the PR comment just a couple of lines above.

rmtree(self._context.working_directory.as_posix())
except Exception as exception:
self._log.warning(f"Failed to remove the working directory: {exception}")
self._context.working_directory.mkdir()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, continue the setup if we failed to remove? or mkdir raise the exception if already exist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue with warning. this is mainly for the dev scenario. the automation run works on fresh vms

Copy link
Contributor

@nagworld9 nagworld9 Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reading mkkdir() raise exception if directory exists. So, in dev scenario it won't work if failed to remove

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is we could raise the exception if we failed to remove because anyway in next step mkdir throws exception if folder exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it will fail and there will be a warning with the info about the delete failure

the dev will get an error, see the warning in the log and fix the issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is we could raise the exception if we failed to remove because anyway in next step mkdir throws exception if folder exist.

Same difference, with the possibility that the working directory goes away in-between the 2 calls and there is no error

@narrieta narrieta merged commit a3a41bd into Azure:develop Dec 20, 2022
@narrieta narrieta deleted the error-handling branch December 20, 2022 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants